Nashville Investment Project by Thu Phuong Nguyen¶

I. Introduction¶

The purpose of this assignment is to develop a data-driven solution for a real estate company seeking to invest in the Nashville area. Through tasks including data cleansing, model building, evaluation, and recommendation, the aim is to create a predictive model that accurately assesses property values and identifies instances of overpricing or underpricing. By comparing various modeling techniques such as logistic regression, decision trees, random forest, gradient boost, and neural network, the assignment seeks to determine the most suitable approach for the problem at hand. Additionally, exploration of ensemble modeling techniques aims to enhance predictive accuracy and robustness. Ultimately, the assignment aims to provide actionable insights to the real estate company, enabling them to make informed investment decisions and maximize returns in the dynamic Nashville real estate market.

II. Analysis¶

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.model_selection import GridSearchCV, KFold
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, roc_auc_score, recall_score 
from sklearn.metrics import mean_squared_error
from sklearn import datasets
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
import datetime as dt
In [2]:
import warnings
warnings.filterwarnings("ignore")
In [3]:
housingData = pd.read_csv('Nashville_housing_data_2013_2016.csv')

Since we are looking to make investment in the growing Nashville area and build a model to accurately find the best deals. Therefore, we will only look at the data within the Nashville area.

In [4]:
# Dataset of Nashville area
housingData = housingData[housingData['Property City'] == 'NASHVILLE']
In [5]:
housingData.shape
Out[5]:
(40280, 31)
In [6]:
housingData.info()
<class 'pandas.core.frame.DataFrame'>
Index: 40280 entries, 0 to 56635
Data columns (total 31 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Unnamed: 0.1                       40280 non-null  int64  
 1   Unnamed: 0                         40280 non-null  int64  
 2   Parcel ID                          40280 non-null  object 
 3   Land Use                           40280 non-null  object 
 4   Property Address                   40280 non-null  object 
 5   Suite/ Condo   #                   5341 non-null   object 
 6   Property City                      40280 non-null  object 
 7   Sale Date                          40280 non-null  object 
 8   Sale Price                         40280 non-null  int64  
 9   Legal Reference                    40280 non-null  object 
 10  Sold As Vacant                     40280 non-null  object 
 11  Multiple Parcels Involved in Sale  40280 non-null  object 
 12  Owner Name                         19953 non-null  object 
 13  Address                            20700 non-null  object 
 14  City                               20700 non-null  object 
 15  State                              20700 non-null  object 
 16  Acreage                            20700 non-null  float64
 17  Tax District                       20700 non-null  object 
 18  Neighborhood                       20700 non-null  float64
 19  image                              20106 non-null  object 
 20  Land Value                         20700 non-null  float64
 21  Building Value                     20700 non-null  float64
 22  Total Value                        20700 non-null  float64
 23  Finished Area                      19089 non-null  float64
 24  Foundation Type                    19088 non-null  object 
 25  Year Built                         19089 non-null  float64
 26  Exterior Wall                      19089 non-null  object 
 27  Grade                              19089 non-null  object 
 28  Bedrooms                           19078 non-null  float64
 29  Full Bath                          19176 non-null  float64
 30  Half Bath                          19074 non-null  float64
dtypes: float64(10), int64(3), object(18)
memory usage: 9.8+ MB

The dataset, with only properties located in Nashville, consists of 40280 samples and 31 variables. We will only use 17 variables for this assignment, including Land Use, Sale Price, Sold As vacant, Muiltiple Parcels Involved in Sale, Acreage, Taxt District, Land Value, Building Value, Total value, Finished Area, Foundation Type, Year Built, Exterior Wall, Grade, Bedrooms, Full Bath, and Half Bath.

Task 1:¶

Use proper data cleansing techniques to ensure you have the highest quality data to model this problem. Detail your process and discuss the decisions you made to clean the data.¶

Answer Task 1:¶

The dataset contains missing values, so mode and median imputation methods will be employed to handle them. Removing duplicate values and outliers will help ensure the highest data quality for modeling. A correlation matrix will be conducted to understand the relationship between the dependent variable and independent variables. Unnecessary variables will be dropped. Additionally, multicollinearity will be detected and addressed using VIF (Variance Inflation Factor).

In [7]:
dropVariable = ['Unnamed: 0.1', 'Unnamed: 0', 'Parcel ID', 'Property Address',
               'Suite/ Condo   #', 'Property City', 'Sale Date', 'Legal Reference',
               'Owner Name', 'Address', 'City', 'State', 'Neighborhood','image']
housingData = housingData.drop(dropVariable, axis=1) #drop unnecessary columns
In [8]:
housingData.info()
<class 'pandas.core.frame.DataFrame'>
Index: 40280 entries, 0 to 56635
Data columns (total 17 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Land Use                           40280 non-null  object 
 1   Sale Price                         40280 non-null  int64  
 2   Sold As Vacant                     40280 non-null  object 
 3   Multiple Parcels Involved in Sale  40280 non-null  object 
 4   Acreage                            20700 non-null  float64
 5   Tax District                       20700 non-null  object 
 6   Land Value                         20700 non-null  float64
 7   Building Value                     20700 non-null  float64
 8   Total Value                        20700 non-null  float64
 9   Finished Area                      19089 non-null  float64
 10  Foundation Type                    19088 non-null  object 
 11  Year Built                         19089 non-null  float64
 12  Exterior Wall                      19089 non-null  object 
 13  Grade                              19089 non-null  object 
 14  Bedrooms                           19078 non-null  float64
 15  Full Bath                          19176 non-null  float64
 16  Half Bath                          19074 non-null  float64
dtypes: float64(9), int64(1), object(7)
memory usage: 5.5+ MB

Missing Values¶

In [9]:
import plotly.express as px
fig = px.bar(housingData.isnull().sum().sort_values(ascending=False), color_discrete_sequence=["lightblue"])
fig.update_layout(showlegend=False, 
                  xaxis_title="",
                  yaxis_title="Missing Value",
                  title={'text': "Figure 1: Number of Missing Values for each Column",
                         'x': 0.50,  
                         'xanchor': 'center',  
                         'yanchor': 'top',
                         'font': {'size': 14}},
                  margin={'t': 100})
fig.show()

Figure 1 displays the number of missing values present in the dataset in which there are missing values in Half Bath, Bedrooms, Foundation Type, Grade, Exterior Wall, Year Built, Finished Area, Full Bath, Total Value, Building Value, Land Value, Tax District, and Acreage.

In [10]:
housingData.describe()
Out[10]:
Sale Price Acreage Land Value Building Value Total Value Finished Area Year Built Bedrooms Full Bath Half Bath
count 4.028000e+04 20700.000000 2.070000e+04 2.070000e+04 2.070000e+04 19089.000000 19089.000000 19078.000000 19176.000000 19074.000000
mean 3.663047e+05 0.464649 7.830631e+04 1.752922e+05 2.562645e+05 1986.433778 1961.542092 3.096918 1.910670 0.297106
std 1.081598e+06 0.957274 1.140855e+05 2.260268e+05 3.052314e+05 1849.770778 27.549181 0.888494 0.996996 0.500055
min 5.000000e+01 0.010000 1.000000e+02 0.000000e+00 1.000000e+02 0.000000 1799.000000 0.000000 0.000000 0.000000
25% 1.428438e+05 0.180000 2.200000e+04 7.770000e+04 1.070000e+05 1251.250000 1945.000000 3.000000 1.000000 0.000000
50% 2.300000e+05 0.260000 3.200000e+04 1.215000e+05 1.683500e+05 1672.000000 1957.000000 3.000000 2.000000 0.000000
75% 3.600000e+05 0.450000 8.500000e+04 2.018000e+05 3.023000e+05 2293.739990 1974.000000 4.000000 2.000000 1.000000
max 5.427806e+07 51.340000 2.772000e+06 1.297180e+07 1.394040e+07 197988.000000 2017.000000 11.000000 10.000000 3.000000

Mode Imputation¶

Mode imputation is suitable because it efficiently handles missing values in categorical or ordinal data by replacing them with the most frequently occurring value (mode). Given that the dataset likely contains categorical variables, mode imputation ensures that missing values are replaced with the most prevalent category, preserving the integrity of the categorical features without introducing significant bias. Therefore, mode imputation is an appropriate choice for maintaining the distribution of categorical variables, including 'Half Bath', 'Bedrooms', 'Foundation Type', 'Grade', 'Exterior Wall', 'Year Built', 'Full Bath', and 'Tax District', while handling missing data efficiently.

In [11]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')
housingData[['Half Bath', 'Bedrooms', 'Foundation Type', 'Grade', 'Exterior Wall', 'Year Built', 'Full Bath', 
             'Tax District']] = imputer.fit_transform(housingData[['Half Bath', 'Bedrooms',
             'Foundation Type', 'Grade', 'Exterior Wall', 'Year Built', 'Full Bath', 'Tax District']])

Median Imputation¶

Median imputation is a suitable method for handling missing data when the data is skewed or contains outliers. Unlike mean imputation, which may be sensitive to outliers, median imputation is robust and less affected by extreme values. Therefore, using median imputation can help preserve the central tendency of the data and provide more accurate estimates, especially in the presence of skewed distributions or outliers.

In [12]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')
housingData[['Finished Area', 'Total Value', 'Building Value', 'Land Value', 
             'Acreage']] = imputer.fit_transform(housingData[['Finished Area', 
             'Total Value', 'Building Value', 'Land Value', 'Acreage']])
In [13]:
housingData.isnull().sum()
Out[13]:
Land Use                             0
Sale Price                           0
Sold As Vacant                       0
Multiple Parcels Involved in Sale    0
Acreage                              0
Tax District                         0
Land Value                           0
Building Value                       0
Total Value                          0
Finished Area                        0
Foundation Type                      0
Year Built                           0
Exterior Wall                        0
Grade                                0
Bedrooms                             0
Full Bath                            0
Half Bath                            0
dtype: int64

Duplicate Value¶

There are 13801 duplicated value which should be dropped.

In [14]:
print('Duplicate data:', housingData.duplicated().sum())
Duplicate data: 13801
In [15]:
housingData = housingData.drop_duplicates() #remove rows with duplicates
print('Duplicate data:', housingData.duplicated().sum())
Duplicate data: 0
In [16]:
housingData.info()
<class 'pandas.core.frame.DataFrame'>
Index: 26479 entries, 0 to 56633
Data columns (total 17 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Land Use                           26479 non-null  object 
 1   Sale Price                         26479 non-null  int64  
 2   Sold As Vacant                     26479 non-null  object 
 3   Multiple Parcels Involved in Sale  26479 non-null  object 
 4   Acreage                            26479 non-null  float64
 5   Tax District                       26479 non-null  object 
 6   Land Value                         26479 non-null  float64
 7   Building Value                     26479 non-null  float64
 8   Total Value                        26479 non-null  float64
 9   Finished Area                      26479 non-null  float64
 10  Foundation Type                    26479 non-null  object 
 11  Year Built                         26479 non-null  object 
 12  Exterior Wall                      26479 non-null  object 
 13  Grade                              26479 non-null  object 
 14  Bedrooms                           26479 non-null  object 
 15  Full Bath                          26479 non-null  object 
 16  Half Bath                          26479 non-null  object 
dtypes: float64(5), int64(1), object(11)
memory usage: 3.6+ MB

Outliers¶

In [17]:
variables = ['Sale Price', 'Finished Area', 'Total Value', 'Building Value', 'Land Value', 
             'Acreage']

plt.figure(figsize=(15, 10))
for i, column in enumerate(variables, 1):
    plt.subplot(3, 3, i) 
    sns.boxplot(x=column, data=housingData) 
    plt.title('Figure: Box Plots of {}'.format(column)) 
    plt.xlabel('Data')
    plt.ylabel('Values') 

plt.tight_layout()
print("Figure 2: Boxplots of 6 variables showing their outliers")
plt.show()
Figure 2: Boxplots of 6 variables showing their outliers

Figure 2 above displays a combined boxplot of six variables, illustrating the outliers that need to be addressed. I selected only these six variables for removing their extreme outliers, as the other variables are either categorical or dummy variables (with only 0 and 1).

In [18]:
variables = ['Sale Price', 'Finished Area', 'Total Value', 'Building Value', 'Land Value', 
             'Acreage']
def removeOutliers(housingData, column):
    Q1 = housingData[column].quantile(0.25)
    Q3 = housingData[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return housingData[(housingData[column] >= lower_bound) & (housingData[column] <= upper_bound)]

# Remove outliers for each variable
for var in variables:
    housingData = removeOutliers(housingData, var)

# How many outliers were removed?
print("Shape after removing outliers:", housingData.shape)

plt.figure(figsize=(15, 10))
for i, column in enumerate(variables, 1):
    plt.subplot(2, 3, i) 
    sns.boxplot(x=column, data=housingData) 
    plt.title('Box Plot of {}'.format(column)) 
    plt.xlabel('Data')
    plt.ylabel('Values') 

plt.tight_layout()
print("Figure 3: Boxplots of 6 variables after removing extreme outliers")
plt.show()
Shape after removing outliers: (12436, 17)
Figure 3: Boxplots of 6 variables after removing extreme outliers

Figure 3 shows the 6 boxplots for those 6 variables after removing 'far and extreme' outliers. Removing all outliers may risk losing valuable insights or distorting the data distribution. Therefore, by specifically targeting 'far and extreme outliers' for removal, we aim to maintain a balance between data cleanliness and preserving its integrity for analysis.

In [26]:
variables = ['Sale Price', 'Total Value']
def removeOutliers(housingData, column):
    Q1 = housingData[column].quantile(0.25)
    Q3 = housingData[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return housingData[(housingData[column] >= lower_bound) & (housingData[column] <= upper_bound)]

# Remove outliers for each variable
for var in variables:
    housingData = removeOutliers(housingData, var)

# How many outliers were removed?
print("Shape after removing outliers:", housingData.shape)

plt.figure(figsize=(15, 10))
for i, column in enumerate(variables, 1):
    plt.subplot(5, 4, i) 
    sns.boxplot(x=column, data=housingData) 
    plt.title('Box Plot of {}'.format(column)) 
    plt.xlabel('Data')
    plt.ylabel('Values') 

plt.tight_layout()
print("Figure 4: Boxplots of 'Sale Price' and 'Total Value' variables after removing outliers")
plt.show()
Shape after removing outliers: (11473, 17)
Figure 4: Boxplots of 'Sale Price' and 'Total Value' variables after removing outliers

In Figure 4 above, outliers were removed from 'Sale Price' and 'Total Value' to mitigate bias when creating a new variable based on these two features.

After removing the outliers, the dataset consists of 11473 samples and 17 variables.

Preprocession¶

In [27]:
# Plot the distribution of Sale Price
plt.figure(figsize=(10, 6))
sns.histplot(housingData['Sale Price'], kde=True)
plt.title('Figure 5: Distribution of Sale Price')
plt.xlabel('Sale Price')
plt.ylabel('Count')
plt.show()

Figure 5 displays the distribution of Sale Price. It is evident that the majority of properties in the dataset have sale prices ranging from 100,000 to 200,000, with some reaching as high as 400,000. The figure illustrates that the 'Sale Price' variable follows a normal distribution.

There is a concern that houses are being sold at prices exceeding their asking prices, prompting the need to build an appropriate model to identify whether a property is overpriced or underpriced.

In [28]:
# Problem: There migh be a concern that houses are going over their asking prices.
salePrice = housingData['Sale Price']
totalValue = housingData['Total Value']

# Means of sale price and total value
meanSalePrice = salePrice.mean()
meanTotalValue = totalValue.mean()

# Plot bar chart
plt.bar(['Sale Price', 'Total Value'], [meanSalePrice, meanTotalValue], color=['lightblue', 'grey'])
plt.title('Figure 6: Problem - Houses are Going over their Asking Prices')
plt.ylabel('Mean Value')
plt.show()

Figure 6 illustrates the difference in the mean sale price and mean total value of properties in Nashville. The blue bar represents the mean sale price, while the grey bar represents the mean total value. The mean sale price bar (blue) is notably higher than the mean total value bar (grey), it suggests that, on average, properties are selling for more than their total value. This would align with the concern that houses are going over their asking price.

In [29]:
housingData.info()
<class 'pandas.core.frame.DataFrame'>
Index: 11473 entries, 0 to 56625
Data columns (total 17 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Land Use                           11473 non-null  object 
 1   Sale Price                         11473 non-null  int64  
 2   Sold As Vacant                     11473 non-null  object 
 3   Multiple Parcels Involved in Sale  11473 non-null  object 
 4   Acreage                            11473 non-null  float64
 5   Tax District                       11473 non-null  object 
 6   Land Value                         11473 non-null  float64
 7   Building Value                     11473 non-null  float64
 8   Total Value                        11473 non-null  float64
 9   Finished Area                      11473 non-null  float64
 10  Foundation Type                    11473 non-null  object 
 11  Year Built                         11473 non-null  object 
 12  Exterior Wall                      11473 non-null  object 
 13  Grade                              11473 non-null  object 
 14  Bedrooms                           11473 non-null  object 
 15  Full Bath                          11473 non-null  object 
 16  Half Bath                          11473 non-null  object 
dtypes: float64(5), int64(1), object(11)
memory usage: 1.6+ MB
In [30]:
# Label encoding for categorical variables in dataset
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

categoricalVar = ['Land Use', 'Sold As Vacant', 'Multiple Parcels Involved in Sale',
                    'Tax District', 'Foundation Type', 'Year Built', 'Exterior Wall',
                    'Grade', 'Bedrooms', 'Full Bath', 'Half Bath']

# Apply label encoding for each categorical column
for var in categoricalVar:
    housingData[var] = label_encoder.fit_transform(housingData[var])

New Dependent Variable¶

For sale prices higher than the total value, it represents 1, indicating overpricing. For sale prices less than the total value, it represents 0, indicating underpricing.

In [31]:
# Difference between sale price and total value
housingData['Price Dif'] = housingData['Sale Price'] - housingData['Total Value']

# Determine over/underpricing
housingData['Over Priced'] = (housingData['Price Dif'] > 0).astype(int)
housingData['Under Priced'] = (housingData['Price Dif'] < 0).astype(int)

housingData['Price Category'] = housingData['Over Priced'] - housingData['Under Priced']
In [32]:
# Create a dependent variable to understand whether it is over/under the price
# Assign 0 to underpriced and 1 to overpriced
housingData['Price Category'] = housingData['Price Category'].apply(lambda x: 1 if x == 1 else 0)

Correlation¶

In [33]:
matrix = housingData.corr()
f, ax = plt.subplots(figsize=(20, 13))
sns.heatmap(matrix, vmax=1, square=True, cmap="BuPu", annot=True)

plt.title('Figure 7: Correlation Matrix', fontsize=16)
plt.show()

The correlation matrix (Figure 7) shows the pairwise correlations between Price Category and other variables in the dataset.

  • Positive correlation coefficients, such as 0.58 for 'Sale Price' and 0.69 for 'Over Priced', indicate a strong positive linear relationship between the variable and Price Category. This suggests that as these variables increase, the likelihood of the property being overpriced also increases.
  • The correlation of -0.18 for 'Sold As Vacant' indicates a very low negative negative linear relationship between the variable and Price Category. This suggests that as these variables increase, the likelihood of the property being underpriced slightly increases.
  • Correlation coefficients close to 0, such as 0.0044 for 'Acreage' and 0.006 for 'Full Bath', indicate a very weak linear relationship between the variable and Price Category. This suggests that there is little to no correlation between these variables and the likelihood of over/underpricing. Those variables can be dropped.
In [34]:
noCorrVar = ['Acreage', 'Full Bath']
housingData = housingData.drop(noCorrVar, axis=1) #drop columns with no correlation with target variable
In [35]:
extremeCorrVar = ['Over Priced', 'Under Priced']
housingData = housingData.drop(extremeCorrVar, axis=1) #drop columns with extreme correlation with target variable

Variance Inflation Factor (VIF)¶

In [36]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calc_vif(X): # Calculating VIF
    vif = pd.DataFrame() 
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    
    vif = vif.sort_values(by='VIF', ascending=False).reset_index(drop=True)
    return(vif)

X = housingData.iloc[:,:-1]
print("Table 1: VIF")
calc_vif(X)
Table 1: VIF
Out[36]:
variables VIF
0 Sale Price inf
1 Total Value inf
2 Price Dif inf
3 Building Value 544.312083
4 Land Value 82.187833
5 Tax District 55.330187
6 Finished Area 48.499363
7 Bedrooms 38.764152
8 Grade 18.210464
9 Land Use 17.860565
10 Year Built 12.111708
11 Exterior Wall 2.003503
12 Sold As Vacant 1.503993
13 Foundation Type 1.407603
14 Half Bath 1.314381
15 Multiple Parcels Involved in Sale 1.254161

Multicollinearity is a critical consideration before employing logistic regression, as it violates the assumption of independence among predictors. To assess multicollinearity, I opted to calculate the Variance Inflation Factor (VIF), showing in Table 1. According to Husnoo (2020), variables with a VIF score exceeding 5 exhibit strong correlation. Hence, these variables are deemed highly correlated and should be omitted from the logistic regression model to mitigate multicollinearity issues.

In [37]:
dropVar = ['Sale Price', 'Total Value', 'Price Dif', 'Building Value', 'Land Value', 
           'Tax District', 'Bedrooms', 'Grade', 'Land Use', 'Year Built']
housing = housingData.drop(dropVar, axis=1)
In [38]:
housing.info()
<class 'pandas.core.frame.DataFrame'>
Index: 11473 entries, 0 to 56625
Data columns (total 7 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Sold As Vacant                     11473 non-null  int64  
 1   Multiple Parcels Involved in Sale  11473 non-null  int64  
 2   Finished Area                      11473 non-null  float64
 3   Foundation Type                    11473 non-null  int64  
 4   Exterior Wall                      11473 non-null  int64  
 5   Half Bath                          11473 non-null  int64  
 6   Price Category                     11473 non-null  int64  
dtypes: float64(1), int64(6)
memory usage: 717.1 KB

Task 2:¶

Build a logistic regression model to accurately identify overpricing/underpricing and determine what is driving those prices.¶

Answer Task 2:¶

The logistic regression model tells us that the three variables have significantly impact on determining the overpricing/underpricing property are 'Sold As Vacant', 'Foundation Type', and 'Finished Area'. The accuracy of the logistic model is 0.54, suggesting that it correctly predicts the class label for 54% of the properties. However, the weighted average F1-score is 0.46, which indicates a balance between precision and recall for both classes.

In [39]:
housing['Price Category'].astype('category').value_counts() 
Out[39]:
Price Category
1    8312
0    3161
Name: count, dtype: int64

There are 8312 samples represents overpricing while there are only 3161 samples represents underpricing.

Data Imbalanced¶

In [40]:
# Visualize Imbalanced Data
fig, ax = plt.subplots()

ax.pie(
    housing['Price Category'].value_counts().values,
    labels=["1","0"],
    autopct="%1.1f%%",
    explode=(0, 0.1),
    shadow=True,
    colors=['lightblue', 'grey']
)

ax.set_title('Figure 8: Data Imbalanced', fontsize=12)

plt.show()

The classes 0 and 1 are imbalanced; therefore, we need to balance the data.

In [41]:
# Splitting dataset into trainset and testset
X = housing.drop('Price Category', axis=1)
y = housing['Price Category']
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=42, stratify=y)

X_train.shape, X_test.shape, y_train.shape, y_test.shape
Out[41]:
((8031, 6), (3442, 6), (8031,), (3442,))

When employing stratify=y in train_test_split, the function attempts to maintain class proportions in the target variable across training and testing sets. However, in highly imbalanced datasets where one class dominates, stratification may fail to effectively balance class distributions. This occurs when there are insufficient instances of minority classes to ensure representative splits. Consequently, despite specifying stratify=y, imbalanced class distributions persist, particularly in the smaller subset. In such cases, additional techniques like oversampling and undersampling become essential to mitigate class imbalance and enhance model performance. I will eventually use the oversampling method to balance the data.

In [42]:
print('Labels count in y:', np.bincount(y))
print('Labels count in y_train:', np.bincount(y_train))
print('Labels count in y_test:', np.bincount(y_test))
Labels count in y: [3161 8312]
Labels count in y_train: [2213 5818]
Labels count in y_test: [ 948 2494]

I standardized both the training and testing sets and addressed the class imbalance in the dataset to adjust the disproportionate representation of different classes, thus optimizing the effectiveness of the models.

In [43]:
# Standardization
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
In [44]:
# Oversampling
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler()
X_train_balanced, y_train_balanced = ros.fit_resample(X_train, y_train)
X_test_balanced, y_test_balanced = ros.fit_resample(X_test, y_test)

X_train_balanced.shape, y_train_balanced.shape, X_test_balanced.shape, y_test_balanced.shape
Out[44]:
((11636, 6), (11636,), (4988, 6), (4988,))
In [45]:
# Visualize Balanced Data 
fig, ax = plt.subplots()

ax.pie(
    y_train_balanced.value_counts().values,
    labels=["1","0"],
    autopct="%1.1f%%",
    explode=(0, 0.1),
    shadow=True,
    colors=['lightblue', 'grey']
)

ax.set_title('Figure 9: Data Balanced', fontsize=12)

plt.show()
In [48]:
X.head(1)
Out[48]:
Sold As Vacant Multiple Parcels Involved in Sale Finished Area Foundation Type Exterior Wall Half Bath
0 0 0 1672.0 0 0 0

Logistic Regression (Feature Selection)¶

In [49]:
# Fit the logistic regression model
model = sm.Logit(y_train_balanced, X_train_balanced)
Logit = model.fit()
print("Table 2: Logit Regression Summary")
print(Logit.summary())
Optimization terminated successfully.
         Current function value: 0.675069
         Iterations 5
Table 2: Logit Regression Summary
                           Logit Regression Results                           
==============================================================================
Dep. Variable:         Price Category   No. Observations:                11636
Model:                          Logit   Df Residuals:                    11630
Method:                           MLE   Df Model:                            5
Date:                Sun, 24 Mar 2024   Pseudo R-squ.:                 0.02608
Time:                        19:53:08   Log-Likelihood:                -7855.1
converged:                       True   LL-Null:                       -8065.5
Covariance Type:            nonrobust   LLR p-value:                 1.012e-88
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1            -0.3544      0.022    -16.352      0.000      -0.397      -0.312
x2             0.0161      0.020      0.790      0.430      -0.024       0.056
x3            -0.0489      0.021     -2.287      0.022      -0.091      -0.007
x4             0.0556      0.020      2.783      0.005       0.016       0.095
x5             0.0190      0.021      0.900      0.368      -0.022       0.060
x6            -0.0067      0.019     -0.347      0.729      -0.045       0.031
==============================================================================

The Pseudo R-squared value indicates the proportion of variance explained by the model. In this case, it's approximately 0.02608, suggesting a low level of explanatory power. The log-likelihood value represents the maximized value of the likelihood function for the model which is about -7855.1.

In [50]:
# Get the coefficients from the logistic regression model
coefficients = Logit.params

# Get the absolute values of coefficients for feature importance
abs_coefficients = np.abs(coefficients)

# Calculate p-values
p_values = Logit.pvalues

# Get the feature names
feature_names = X.columns

# Create a DataFrame to store the results
results = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': coefficients,
    'Absolute Coefficient': abs_coefficients,
    'P-value': p_values
})

# Sort the DataFrame by absolute coefficient values
results = results.sort_values(by='Absolute Coefficient', ascending=False)

print("Table 3: Coefficient, Abs. Coefficient, and p-values of Variables in Logistic Regression")
results
Table 3: Coefficient, Abs. Coefficient, and p-values of Variables in Logistic Regression
Out[50]:
Feature Coefficient Absolute Coefficient P-value
x1 Sold As Vacant -0.354387 0.354387 4.205507e-60
x4 Foundation Type 0.055575 0.055575 5.384554e-03
x3 Finished Area -0.048890 0.048890 2.221791e-02
x5 Exterior Wall 0.018985 0.018985 3.680185e-01
x2 Multiple Parcels Involved in Sale 0.016054 0.016054 4.295952e-01
x6 Half Bath -0.006700 0.006700 7.288390e-01

The table presents the coefficients, absolute coefficients, and p-values for each feature in the logistic regression model. Notably, "Sold As Vacant" has the highest absolute coefficient of -0.354387, indicating its negative relationship with the outcome variable. Additionally, it demonstrates a very low p-value of 4.205507e-60 (less than 0.05), signifying its high statistical significance. With the p-value of 5.384554e-03 and 2.221791e-02 (less than 0.05), 'Foundation Type' and 'Finished Area' are also statistical significant. Conversely, "Half Bath" exhibits the lowest absolute coefficient of 0.016463 and a relatively high p-value of 3.942181e-01, suggesting it has less influence and no statistical significance in predicting the outcome variable. Other features like "Multiple Parcels Involved in Sale" and "Exterior Wall" also display high p-values, indicating their insignificant in the model. Therefore, there are three variables that are significant as follow.

In [51]:
# p-values less than 0.05
significant_coefficients = coefficients[p_values < 0.05]
significant_abs_coefficients = np.abs(significant_coefficients)
significant_feature_names = feature_names[p_values < 0.05]

# Sort features based on their absolute coefficients
sorted_indices = np.argsort(significant_abs_coefficients)
sorted_features = significant_feature_names[sorted_indices]
sorted_coefficients = significant_abs_coefficients[sorted_indices]

# Bar chart for feature importance
plt.figure(figsize=(10, 2))
plt.barh(sorted_features, sorted_coefficients, color='lightblue')
plt.title('Figure 10: Feature Importance (Logistic Regression)')
plt.xlabel('Absolute Coefficient')
plt.tight_layout()
plt.show()

Figure 10 illustrates the feature importance of the logistic regression in which 'Sold As Vacant' is the most important feature, significanly higher than the 'Foundation Type' and 'Finished Area'. The coefficient of -0.354387 for the feature 'Sold As Vacant' indicates the estimated effect of this feature on the dependent variable in the statistical model. Specifically, it suggests that properties sold as vacant tend to have a decrease in sale price by approximately 0.354387 units for every unit increase in the "Sold As Vacant" feature.

In [169]:
# Make predictions
y_pred = Logit.predict(X_test_balanced)
In [178]:
# Scatterplot of Predicted vs. Actual Values
plt.scatter(y_pred, y_test_balanced, color='lightblue')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Figure 11: Scatter Plot of Predicted vs Actual Values of Logistic Regression')
plt.show()

Logistic Regression¶

The objective behind hyperparameter tuning for a logistic regression classifier through grid search with cross-validation is to enhance the classifier's performance by identifying the most effective combination of hyperparameters (Okamura, 2020). This iterative process systematically explores various parameter values and assesses the model's performance using cross-validation, aiming to enhance both accuracy and generalization capabilities. Moreover, the accompanying plot offers a visual representation of the cross-validation results, shedding light on the model's performance under different parameter configurations.

In [53]:
modelsResult = pd.DataFrame({
    'Model': [],
    'Accuracy': [],
    'Precision': [],
    'Recall': []
})
In [54]:
def concat_result(df, y_pred, model):
    newModel = pd.DataFrame({
        'Model': [model],
        'Accuracy': [accuracy_score(y_pred=y_pred, y_true=y_test_balanced)],
        'Precision': [precision_score(y_pred=y_pred, y_true=y_test_balanced)],
        'Recall': [recall_score(y_pred=y_pred, y_true=y_test_balanced)]
    })
    
    modelsResult = pd.concat([df, newModel], axis=0, ignore_index=True)
    
    return modelsResult
In [55]:
# Hyperparameters for logistic regression
lr_params = {
    "penalty": ['l1', 'l2'],
    "C": [0.001, 0.01, 0.1, 1, 10, 100],
    "solver": ['saga', 'liblinear']
}
In [56]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)
In [57]:
# GridSearchCV
clf_lr = GridSearchCV(
    estimator=LogisticRegression(max_iter=1000),
    param_grid=lr_params,
    scoring='accuracy',
    cv=kf
)

clf_lr.fit(X_train_balanced, y_train_balanced)

print(f"Best hiperparams of Logistic Regression: \n{clf_lr.best_estimator_}")
Best hiperparams of Logistic Regression: 
LogisticRegression(C=0.001, max_iter=1000, solver='saga')
In [58]:
# Make predictions
y_pred_lr = clf_lr.predict(X_test_balanced)
In [179]:
# Confusion Matrix
cm_lr = confusion_matrix(y_test_balanced,y_pred_lr)
cmap = sns.light_palette("lightblue", as_cmap=True)
sns.heatmap(cm_lr, annot=True, fmt="d", cmap=cmap)
plt.title('Figure 12: Confusion Matrix of Logistic Regression')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

The confusion matrix provides insight into the performance of a classification model distinguishing between underpriced and overpriced properties. In this context, where 0 denotes underpriced properties and 1 indicates overpriced ones based on the difference between sale price and total value, the matrix reveals the following: The model accurately identified 365 instances as underpriced properties (0) when they were indeed underpriced, demonstrating its ability to correctly classify such cases (TN). However, it misclassified 144 instances as underpriced properties (0) when they were actually overpriced, indicating FN. Moreover, the model erroneously labeled 2129 instances as overpriced properties (1) when they were underpriced, representing false positives. On a positive note, the model successfully identified 2350 instances as overpriced properties (1) when they were indeed overpriced, showcasing its capability to detect such cases (TP). Despite its proficiency in identifying overpriced properties, the model's tendency to misclassify underpriced properties may warrant further refinement to enhance its predictive accuracy and reliability, ensuring more informed decision-making in real estate investments.

In [166]:
# Import classification report
from sklearn.metrics import classification_report

print('Table 4: Classification Report of the Logistic Regression \n')
print(classification_report(y_test_balanced,y_pred_lr))
Table 4: Classification Report of the Logistic Regression 

              precision    recall  f1-score   support

           0       0.72      0.15      0.24      2494
           1       0.52      0.94      0.67      2494

    accuracy                           0.54      4988
   macro avg       0.62      0.54      0.46      4988
weighted avg       0.62      0.54      0.46      4988

Table 4 shows the classification report, providing insights into the performance of a model in predicting underpriced (0) and overpriced (1) properties. For underpriced properties (0), the precision is 0.72, indicating that when the model predicts a property as underpriced, it is correct 72% of the time. However, the recall is only 0.15, meaning that the model only identifies 15% of all underpriced properties. For overpriced properties (1), the precision is 0.52, indicating that when the model predicts a property as overpriced, it is correct 52% of the time. The recall is higher at 0.94, meaning that the model successfully identifies 94% of all overpriced properties. Overall, the model's accuracy is 0.54, suggesting that it correctly predicts the class label for 54% of the properties. However, the weighted average F1-score is 0.46, which indicates a balance between precision and recall for both classes.

In [60]:
modelsResult = concat_result(modelsResult, y_pred_lr, 'Logistic Regression')

Task 3:¶

Build a decision tree model.¶

Answer Task 3:¶

The Decision Tree Classifier attained an accuracy of 0.549, precision of 0.565, and recall of 0.422.

In [61]:
# Hyperparameters for decision tree classifier
tree_params = {
    'criterion': ["gini", "entropy"],
    'splitter': ["best", "random"],
    'min_samples_split': [2, 3, 5]
}
In [62]:
# GridSearchCV
clf_tree = GridSearchCV(
    estimator=DecisionTreeClassifier(),
    param_grid=tree_params,
    scoring='accuracy',
    cv=kf
)

clf_tree.fit(X_train_balanced, y_train_balanced)

print(f"Best hiperparams of Decission Tree model: \n{clf_tree.best_estimator_}")
Best hiperparams of Decission Tree model: 
DecisionTreeClassifier(criterion='entropy')
In [180]:
# Fit the decision tree model wih the best estimator from GridSearchCV
best_tree = clf_tree.best_estimator_

# Get feature importances from the best decision tree model
feature_importances = best_tree.feature_importances_
feature_names = X.columns

# Sort feature importances
sorted_indices = np.argsort(feature_importances)
sorted_features = feature_names[sorted_indices]
sorted_feature_importances = feature_importances[sorted_indices]

# Plot feature importances
plt.figure(figsize=(10, 5))
plt.barh(sorted_features, sorted_feature_importances, color='lightblue')
plt.title('Figure 13: Feature Importance (Decision Tree)')
plt.xlabel('Importance')
plt.tight_layout()
plt.show()

Figure 13 illustrates the six important features of the decision tree model: 'Finished Area', 'Exterior Wall', 'Foundation Type', 'Sold As Vacant', 'Half Bath', and 'Multiple Parcels Involved in Sale'. Notably, 'Finished Area' emerges as nearly 7 times more important than the other variables.

In [64]:
# Make predictions
y_pred_tree = best_tree.predict(X_test_balanced)
In [181]:
# Confusion Matrix
cm_tree = confusion_matrix(y_test_balanced, y_pred_tree)
cmap = sns.light_palette("lightblue", as_cmap=True)
sns.heatmap(cm_tree, annot=True, fmt="d", cmap=cmap)
plt.title('Figure 14: Confusion Matrix of Decision Tree Classifier')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

The model correctly predicted 1685 instances (TN) as class 0 when they were indeed class 0, indicating correct classifications of properties that are not overpriced. While the model performs reasonably well in identifying overpriced properties (1052 True Positives), it struggles with misclassifying some overpriced properties as not overpriced (1442 False Negatives) and incorrectly labeling some not overpriced properties as overpriced (809 False Positives). This indicates areas where the model's predictions may be improved to enhance its overall performance in accurately identifying overpriced properties.

In [66]:
modelsResult = concat_result(modelsResult, y_pred_tree, 'Decision Tree Classifier')

Task 4:¶

Build a Random Forest model.¶

Answer Task 4:¶

The Random Forest Classifier achieved an accuracy of 0.548, precision of 0.562, and recall of 0.438.

In [67]:
# Hyperparameters for random forest classifier
rf_params = {
    "n_estimators": [70, 90, 110],
    "criterion": ['gini', 'entropy'],
    'min_samples_split': [2, 3, 5]
}
In [68]:
# GridSearchCV
clf_rf = GridSearchCV(
    estimator=RandomForestClassifier(),
    param_grid=rf_params,
    scoring='accuracy',
    cv=kf
)

clf_rf.fit(X_train_balanced, y_train_balanced)

print(f"Best hiperparams of Random Forest model: \n{clf_rf.best_estimator_}")
Best hiperparams of Random Forest model: 
RandomForestClassifier(min_samples_split=3, n_estimators=70)
In [182]:
# Fit the decision tree model wih the best estimator from GridSearchCV
best_rf = clf_rf.best_estimator_

# Get feature importances from the best decision tree model
feature_importances = best_rf.feature_importances_
feature_names = X.columns

# Sort feature importances
sorted_indices = np.argsort(feature_importances)
sorted_features = feature_names[sorted_indices]
sorted_feature_importances = feature_importances[sorted_indices]

# Plot feature importances
plt.figure(figsize=(10, 5))
plt.barh(sorted_features, sorted_feature_importances, color='lightblue')
plt.title('Figure 15: Feature Importance (Random Forest Model)')
plt.xlabel('Importance')
plt.tight_layout()
plt.show()

Figure 15 illustrates the six important features of the decision tree model: 'Finished Area', 'Sold As Vacant', 'Exterior Wall', 'Foundation Type', 'Half Bath', and 'Multiple Parcels Involved in Sale'. Remarkably, 'Finished Area' stands out as over 8 times more important than the other variables. On the other hand, 'Half Bath' and 'Multiple Parcels Involved in Sale' are not much important.

In [70]:
# Make predictions
y_pred_rf = best_rf.predict(X_test_balanced)
In [183]:
# Confusion Matrix
cm_rf = confusion_matrix(y_pred=y_pred_rf, y_true=y_test_balanced)
cmap = sns.light_palette("lightblue", as_cmap=True)
sns.heatmap(cm_rf, annot=True, fmt="d", cmap=cmap)
plt.title('Figure 16: Confusion Matrix of Random Forest Classifier')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

Based on the figure 16, we can conclude that the model performed reasonably well in identifying properties that are not overpriced (TN: 1642) but struggled more with correctly identifying overpriced properties (TP: 1092) and had a significant number of misclassifications in both directions (FP: 852 and FN: 1402). This suggests that while the model has some effectiveness, there is room for improvement, particularly in reducing false predictions.

In [72]:
modelsResult = concat_result(modelsResult, y_pred_rf, 'Random Forest Classifier')

Task 5:¶

Build a Gradient Boost model.¶

Asnwer Task 5:¶

The Gradient Boosting Classifier achieved an accuracy of 0.554, precision of 0.566, and recall of 0.461.

In [73]:
# Hyperparameters for gradient boosting classifier
gb_params = {
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.01, 0.1, 0.5],
    'max_depth': [3, 5, 7]
}
In [74]:
# GridSearchCV
clf_gb = GridSearchCV(
    estimator=GradientBoostingClassifier(),
    param_grid=gb_params,
    scoring='accuracy',
    cv=kf
)

clf_gb.fit(X_train_balanced, y_train_balanced)

print(f"Best hiperparams of Gradient Boost model: \n{clf_gb.best_estimator_}")
Best hiperparams of Gradient Boost model: 
GradientBoostingClassifier(learning_rate=0.5, max_depth=7, n_estimators=150)
In [184]:
# Fit the gradient boost model wih the best estimator from GridSearchCV
best_gb = clf_gb.best_estimator_

# Get feature importances from the best gradient boost model
feature_importances = best_gb.feature_importances_
feature_names = X.columns

# Sort feature importances
sorted_indices = np.argsort(feature_importances)
sorted_features = feature_names[sorted_indices]
sorted_feature_importances = feature_importances[sorted_indices]

# Plot feature importances
plt.figure(figsize=(10, 5))
plt.barh(sorted_features, sorted_feature_importances, color='lightblue')
plt.title('Figure 17: Feature Importance (Gradient Boost Model)')
plt.xlabel('Importance')
plt.tight_layout()
plt.show()

Figure 17 shows the six important features of the decision tree model: 'Finished Area', 'Exterior Wall', 'Foundation Type', 'Sold As Vacant', 'Half Bath', and 'Multiple Parcels Involved in Sale'. While 'Multiple Parcels Involved in Sale' are not significantly important, 'Finished Area' stands out as over 6 or 7 times more important than the other variables.

In [76]:
# Make predictions
y_pred_gb = clf_gb.predict(X_test_balanced)
In [185]:
# Confusion Matrix
cm_gb = confusion_matrix(y_pred=y_pred_gb, y_true=y_test_balanced)
cmap = sns.light_palette("lightblue", as_cmap=True)
sns.heatmap(cm_gb, annot=True, fmt="d", cmap=cmap)
plt.title('Figure 18: Confusion Matrix of Gradient Boosting Classifier')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

The model correctly predicted 1612 instances (TN) as class 0 when they were indeed class 0, indicating accurate classifications of properties that are not overpriced. The model demonstrates reasonably good performance in identifying overpriced properties, as indicated by the high number of TP (1149). However, it still struggles with misclassifying some overpriced properties as not overpriced (1345 FN) and incorrectly labeling some not overpriced properties as overpriced (882 FP). These areas of misclassification highlight potential areas for improvement to enhance the model's accuracy in identifying overpriced properties more effectively.

In [78]:
modelsResult = concat_result(modelsResult, y_pred_gb, 'Gradient Boosting Classifier')

Task 6:¶

Build a Neural Network model.¶

Answer Task 6:¶

The Neural Network model achieved an accuracy of 0.562, precision of 0.566, and recall of 0.534.

In [115]:
# Hyperparameters for neural network
nn_params = {
    'hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 100)],
    'activation': ['relu', 'tanh', 'logistic'],
    'solver': ['adam', 'sgd'],
    'learning_rate_init': [0.001, 0.01, 0.1]
}
In [117]:
from sklearn.model_selection import RandomizedSearchCV
# GridSearchCV
random_search_nn = RandomizedSearchCV(
    estimator=MLPClassifier(),
    param_distributions=nn_params,
    scoring='accuracy',
    cv=kf,
    n_iter=10
)

random_search_nn.fit(X_train_balanced, y_train_balanced)

print(f"Best hiperparams of Neural Network model: \n{random_search_nn.best_estimator_}")
Best hiperparams of Neural Network model: 
MLPClassifier(hidden_layer_sizes=(100, 100), learning_rate_init=0.1,
              solver='sgd')
In [118]:
# Make predictions
y_pred_nn = random_search_nn.predict(X_test_balanced)
In [119]:
# Confusion Matrix
cm_nn = confusion_matrix(y_pred=y_pred_nn, y_true=y_test_balanced)
cmap = sns.light_palette("lightblue", as_cmap=True)
sns.heatmap(cm_nn, annot=True, fmt="d", cmap=cmap)
plt.title('Figure 19: Confusion Matrix of Neural Network')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
In [120]:
modelsResult = concat_result(modelsResult, y_pred_nn, 'Neural Network')

Task 7:¶

Use multiple benchmark evaluation metrics to compare and contrast the five models. Based on your findings, provide evidence of which model you believe the real estate company should use, the key variables to focus on to drive value, and how they can get the most value out of the houses they should be targeting.¶

Remember, the goal is to help the company make money and solve the problem of what variables to consider in good value properties; building an accurate model doesn't guarantee more money.¶

Based on the evaluation metrics, the real estate company should consider using the Logistic Regression model due to its high recall score of 0.942, indicating its effectiveness in identifying overpriced properties. Key variables such as "Sold As Vacant," "Finished Area," and "Exterior Wall" play significant roles in determining property value. Leveraging the model predictions, the company can target overpriced properties for negotiation and focus on properties with desirable characteristics to maximize value and investment returns.

In [200]:
modelsResult5 = modelsResult.sort_values(by='Accuracy', ascending=False)
print("Table 5: Accuracy, Precision, and Recall of 5 Models (Sorted by Accuracy)")
modelsResult5
Table 5: Accuracy, Precision, and Recall of 5 Models (Sorted by Accuracy)
Out[200]:
Model Accuracy Precision Recall
4 Neural Network 0.562149 0.565845 0.534082
3 Gradient Boosting Classifier 0.553528 0.565731 0.460706
1 Decision Tree Classifier 0.548717 0.565287 0.421812
2 Random Forest Classifier 0.548115 0.561728 0.437851
0 Logistic Regression 0.544306 0.524671 0.942261
In [201]:
# Sort modelsResult by Recall
modelsResult_sorted = modelsResult.sort_values(by='Recall', ascending=True)

bar_width = 0.25
num_models = len(modelsResult_sorted)

index = np.arange(num_models) # Set the positions for the bars

# Plot
plt.figure(figsize=(12, 8))
plt.barh(index, modelsResult_sorted['Accuracy'], bar_width, color='skyblue', label='Accuracy')
plt.barh(index + bar_width, modelsResult_sorted['Precision'], bar_width, color='lightgreen', label='Precision')
plt.barh(index + 2*bar_width, modelsResult_sorted['Recall'], bar_width, color='grey', label='Recall')

plt.xlabel('Score')
plt.title('Figure 20: Accuracy, Precision, and Recall of 5 Models (Sorted by Recall)')
plt.yticks(index + bar_width, modelsResult_sorted['Model'])
plt.legend()

plt.show()

Figure 20 displays the accuracy, precision, and recall scores for the 5 models. These models are sorted by recall, as their accuracy and precision do not vary significantly. It is evident that logistic regression has the highest recall score compared to the other 4 models.

Task 8 (bonus):¶

Create an ensemble of the models trained above using the majority voting approach. Compare evaluation metrics with those of the individual models.¶

The ensemble model slightly outperforms individual models in accuracy and precision but lags in recall compared to logistic regression, crucial for identifying overpriced properties. While the ensemble combines multiple models' strengths, logistic regression's high recall makes it more preferable for pinpointing overpriced properties, ensuring better profitability for the real estate company.

In [97]:
from sklearn.ensemble import VotingClassifier
# Define the ensemble using VotingClassifier
ensemble_model = VotingClassifier(
    estimators=[
        ('logistic', clf_lr),
        ('decision_tree', clf_tree),
        ('random_forest', clf_rf),
        ('gradient_boosting', clf_gb),
        ('neural_network', random_search_nn)
    ],
    voting='hard'  # majority voting
)
# Fit the ensemble model
ensemble_model.fit(X_train_balanced, y_train_balanced)
Out[97]:
VotingClassifier(estimators=[('logistic',
                              GridSearchCV(cv=KFold(n_splits=5, random_state=42, shuffle=True),
                                           estimator=LogisticRegression(max_iter=1000),
                                           param_grid={'C': [0.001, 0.01, 0.1,
                                                             1, 10, 100],
                                                       'penalty': ['l1', 'l2'],
                                                       'solver': ['saga',
                                                                  'liblinear']},
                                           scoring='accuracy')),
                             ('decision_tree',
                              GridSearchCV(cv=KFold(n_splits=5, random_state=42, shuffle=True),
                                           estima...
                                                       'n_estimators': [50, 100,
                                                                        150]},
                                           scoring='accuracy')),
                             ('neural_network',
                              RandomizedSearchCV(cv=KFold(n_splits=5, random_state=42, shuffle=True),
                                                 estimator=MLPClassifier(),
                                                 param_distributions={'activation': ['relu',
                                                                                     'tanh',
                                                                                     'logistic'],
                                                                      'hidden_layer_sizes': [(50,),
                                                                                             (100,),
                                                                                             (50,
                                                                                              50),
                                                                                             (100,
                                                                                              100)],
                                                                      'learning_rate_init': [0.001,
                                                                                             0.01,
                                                                                             0.1],
                                                                      'solver': ['adam',
                                                                                 'sgd']},
                                                 scoring='accuracy'))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
VotingClassifier(estimators=[('logistic',
                              GridSearchCV(cv=KFold(n_splits=5, random_state=42, shuffle=True),
                                           estimator=LogisticRegression(max_iter=1000),
                                           param_grid={'C': [0.001, 0.01, 0.1,
                                                             1, 10, 100],
                                                       'penalty': ['l1', 'l2'],
                                                       'solver': ['saga',
                                                                  'liblinear']},
                                           scoring='accuracy')),
                             ('decision_tree',
                              GridSearchCV(cv=KFold(n_splits=5, random_state=42, shuffle=True),
                                           estima...
                                                       'n_estimators': [50, 100,
                                                                        150]},
                                           scoring='accuracy')),
                             ('neural_network',
                              RandomizedSearchCV(cv=KFold(n_splits=5, random_state=42, shuffle=True),
                                                 estimator=MLPClassifier(),
                                                 param_distributions={'activation': ['relu',
                                                                                     'tanh',
                                                                                     'logistic'],
                                                                      'hidden_layer_sizes': [(50,),
                                                                                             (100,),
                                                                                             (50,
                                                                                              50),
                                                                                             (100,
                                                                                              100)],
                                                                      'learning_rate_init': [0.001,
                                                                                             0.01,
                                                                                             0.1],
                                                                      'solver': ['adam',
                                                                                 'sgd']},
                                                 scoring='accuracy'))])
LogisticRegression(max_iter=1000)
LogisticRegression(max_iter=1000)
DecisionTreeClassifier()
DecisionTreeClassifier()
RandomForestClassifier()
RandomForestClassifier()
GradientBoostingClassifier()
GradientBoostingClassifier()
MLPClassifier()
MLPClassifier()
In [131]:
# Make predictions using the ensemble model
y_pred_ensemble = ensemble_model.predict(X_test_balanced)
In [203]:
modelsResult = concat_result(modelsResult, y_pred_ensemble, 'Ensemble')
In [156]:
modelsResult6 = modelsResult.sort_values(by='Accuracy', ascending=False)
print("Table 5: Accuracy, Precision, and Recall of Models of 6 Models (Sorted by Accuracy)")
modelsResult6
Table 5: Accuracy, Precision, and Recall of Models of 6 Models (Sorted by Accuracy)
Out[156]:
Model Accuracy Precision Recall
4 Neural Network 0.562149 0.565845 0.534082
5 Ensemble 0.558741 0.569729 0.479952
3 Gradient Boosting Classifier 0.553528 0.565731 0.460706
1 Decision Tree Classifier 0.548717 0.565287 0.421812
2 Random Forest Classifier 0.548115 0.561728 0.437851
0 Logistic Regression 0.544306 0.524671 0.942261

The table 5 provides a summary of the performance metrics (accuracy, precision, and recall) for 6 different machine learning models applied to the dataset (sorted by Accuracy):

  • Neural Network: it shows the highest accuracy among all models, around 56.21%, with a precision of about 56.58% and a recall of approximately 53.41%. This model exhibits a good balance between precision and recall, indicating its effectiveness in identifying both positive and negative instances.
  • Ensemble: it combines predictions from multiple models through majority voting results in an accuracy of approximately 55.87%, with a precision of about 56.97% and a recall of around 47.99%. This ensemble approach improves upon individual models' performance, achieving a better balance between precision and recall.
  • Gradient Boosting Classifier: it demonstrates a slightly higher accuracy of about 55.35%, with a precision of approximately 56.57% and a recall of around 46.07%. This model shows a better balance between precision and recall compared to the decision tree and random forest models.
  • Random Forest Classifier: it shows an accuracy of approximately 54.87%, with a precision of about 56.53% and a recall of around 42.18%. This model exhibits higher precision than recall, implying that it is more accurate in identifying positive instances but may miss some positive instances.
  • Logistic regression: it achieves an accuracy of approximately 54.43%, with a precision of about 52.47% and a recall of around 94.23%. This model shows relatively high recall, indicating that it can effectively capture the majority of positive instances (overpriced houses) but might have lower precision, suggesting that it may misclassify some negative instances (underpriced houses) as positive.
In [204]:
# Sort modelsResult by Recall
modelsResult_sorted = modelsResult.sort_values(by='Recall', ascending=True)

bar_width = 0.25

num_models = len(modelsResult_sorted) # Get the number of models
index = np.arange(num_models)

# Plot
plt.figure(figsize=(12, 8))
plt.barh(index, modelsResult_sorted['Accuracy'], bar_width, color='skyblue', label='Accuracy')
plt.barh(index + bar_width, modelsResult_sorted['Precision'], bar_width, color='lightgreen', label='Precision')
plt.barh(index + 2*bar_width, modelsResult_sorted['Recall'], bar_width, color='grey', label='Recall')

plt.xlabel('Score')
plt.title('Figure 21: Summary of Accuracy, Precision, and Recall of Models (Sorted by Recal)')
plt.yticks(index + bar_width, modelsResult_sorted['Model'])
plt.legend()

plt.show()

The Figure 21 illustrates the results of accuracy, precision, and recall scores for the 6 models in which the 6 models have approximately accuracy score and precision. Therefore, I decided to compare their recall. As we can observe, logistic regression has the highest recall which is nearly 94.23% even though the model is slightly lower in accuracy and precision compared to other models.

Based on the evaluation metrics, the ensemble model performs slightly better than individual models in terms of accuracy and precision, with an accuracy of 0.559 and precision of 0.570. However, the ensemble model's recall score is lower than that of the logistic regression model, indicating that it may not be as effective in identifying overpriced properties. Nonetheless, the ensemble approach combines the strengths of multiple models, providing a more robust prediction framework. Therefore, while the ensemble model improves overall performance, the logistic regression model remains preferable for its high recall score in identifying overpriced properties.

III. Conclusion¶

Based on the evaluation of various machine learning models, including logistic regression, decision trees, random forest, gradient boost, neural network, and ensemble methods, it is evident that the models perform differently in terms of accuracy, precision, and recall. While the logistic regression model exhibits the highest recall, indicating its effectiveness in identifying overpriced properties, the ensemble model shows improved accuracy and precision. Considering the assignment's objective to develop a data-driven solution for a real estate company seeking to invest in the Nashville area, the ensemble model offers a balance between overall performance and the ability to detect overpricing. By leveraging ensemble techniques and the insights gained from model evaluation, the real estate company can make informed investment decisions and maximize returns in the dynamic Nashville real estate market.

IV. References¶

Husnoo, A. (2020, November 13). A practical guide to logistic regression in Python for Beginners. Medium. https://medium.com/analytics-vidhya/a-practical-guide-to-logistic-regression-in-python-for-beginners-f04cf6b63d33#:~:text=To%20check%20for%20multi%2Dcollinearity%20in%20the%20independent%20variables%2C%20the,in%20the%20logistic%20regression%20model.

Okamura, S. (2020, December 30). GRIDSEARCHCV for beginners. Medium. https://towardsdatascience.com/gridsearchcv-for-beginners-db48a90114ee

Sklearn.model_selection.GRIDSEARCHCV. scikit. (n.d.-b). https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html